This notebook is the second of a set of steps to run machine learning on the cloud. In this step, we will use the data and associated analysis metadata prepared in the previous notebook and continue with training a model.
In [8]:
import google.datalab as datalab
import google.datalab.ml as ml
import mltoolbox.regression.dnn as regression
import os
import time
The storage bucket was created in the previous notebook. We'll re-declare it here, so we can use it.
In [3]:
storage_bucket = 'gs://' + datalab.Context.default().project_id + '-datalab-workspace/'
storage_region = 'us-central1'
workspace_path = os.path.join(storage_bucket, 'census')
In [4]:
!gsutil ls -r {workspace_path}/data
In [5]:
train_data_path = os.path.join(workspace_path, 'data/train.csv')
eval_data_path = os.path.join(workspace_path, 'data/eval.csv')
schema_path = os.path.join(workspace_path, 'data/schema.json')
train_data = ml.CsvDataSet(file_pattern=train_data_path, schema_file=schema_path)
eval_data = ml.CsvDataSet(file_pattern=eval_data_path, schema_file=schema_path)
In [6]:
analysis_path = os.path.join(workspace_path, 'analysis')
In [7]:
!gsutil ls {analysis_path}
Training in cloud is accomplished by submitting jobs to Cloud Machine Learning Engine. When submitting jobs, it is a good idea to name each job, so it can be looked up easily (names do need to be unique within the scope of a project).
Additionally you'll want to pick a region where your job will run. Usually this is in the same region as where your training data resides.
Finally, you'll want to pick a scale tier. The documentation describes different scale tiers or custom cluster setups you can use with ML Engine. For the purposes of this sample, a simple single node cluster suffices.
In [20]:
config = ml.CloudTrainingConfig(region=storage_region, scale_tier='BASIC')
training_job_name = 'census_regression_' + str(int(time.time()))
training_path = os.path.join(workspace_path, 'training')
In [21]:
features = {
"WAGP": {"transform": "target"},
"SERIALNO": {"transform": "key"},
"AGEP": {"transform": "embedding", "embedding_dim": 2}, # Age
"COW": {"transform": "one_hot"}, # Class of worker
"ESP": {"transform": "embedding", "embedding_dim": 2}, # Employment status of parents
"ESR": {"transform": "one_hot"}, # Employment status
"FOD1P": {"transform": "embedding", "embedding_dim": 3}, # Field of degree
"HINS4": {"transform": "one_hot"}, # Medicaid
"INDP": {"transform": "embedding", "embedding_dim": 5}, # Industry
"JWMNP": {"transform": "embedding", "embedding_dim": 2}, # Travel time to work
"JWTR": {"transform": "one_hot"}, # Transportation
"MAR": {"transform": "one_hot"}, # Marital status
"POWPUMA": {"transform": "one_hot"}, # Place of work
"PUMA": {"transform": "one_hot"}, # Area code
"RAC1P": {"transform": "one_hot"}, # Race
"SCHL": {"transform": "one_hot"}, # School
"SCIENGRLP": {"transform": "one_hot"}, # Science
"SEX": {"transform": "one_hot"},
"WKW": {"transform": "one_hot"} # Weeks worked
}
NOTE: To facilitate re-running this notebook, any previous training outputs are first deleted, if they exist.
In [ ]:
!gsutil rm -rf {training_path}
NOTE: The job submitted below can take a few minutes to complete. Once you have submitted, you can continue with more steps in the notebook, until the call to job.wait()
.
In [22]:
job = regression.train_async(train_dataset=train_data, eval_dataset=eval_data,
features=features,
analysis_dir=analysis_path,
output_dir=training_path,
max_steps=2000,
layer_sizes=[5, 5, 5],
job_name=training_job_name,
cloud=config)
When a job is submitted to ML Engine, a few things happen. The code for the job is staged in Google Cloud Storage, and a job definition is submitted to the service.
The service queues the job, and thereafter the job can be monitored in the console (status and logs), as well as using TensorBoard. The service also provisions computation resources based on the choice of scale tier, installs your code package and its dependencies, and starts your training process. Thereafter, the service monitors the job for completion, and retries if necessary.
The first step in the process - launching a training cluster - can take a few minutes. It is recommended to use BASIC
tier to first validate jobs on cloud and use that for faster iteration to benefit from quicker job starts, and then launch larger scaled jobs where the overhead of launching a cluster is small relative to the life of the job itself.
You can check the progress of the job using the link to the console page above, as well as its logs.
In [ ]:
tensorboard_pid = ml.TensorBoard.start(training_path)
In [23]:
# Wait for the job to be complete before proceeding.
job.wait()
Out[23]:
In [18]:
!gsutil ls -r {training_path}/model
In [16]:
ml.TensorBoard.stop(tensorboard_pid)
Once a model has been created, the next step is to evaluate it, possibly against multiple evaluation steps. We'll continue with this step in the next notebook.